Welcome to my Open Data Science Project -page. Every chapter includes analysis of different (open) data with different multivariate statistical methods. Check also link to my GitHub repository: https://github.com/paikalla/IODS-project
# Setting up
library(car)
library(GGally)
library(ggplot2)
# Data
learning2014 <- read.csv("./data/learning/learning2014.csv", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.5)))
We will choose attitude, stra, and surf as explanatory variables by looking at the correlations of variable points with other variables.
Let’s fit a linear model with attitude, stra and surf as explanatory variables.
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
As shown on the model summary, the intercept (11.0171) shows the estimated exam points when explanatory variables are set to zero. According to the model, a one-point increase in general attitude towards statistics increases the expected exam points by 3.4. The estimated effect is statistically highly significant (p-value < 0.001). We can state that under the null hypothesis of no effect (β = 0), it is very unlikely to get an estimate of β of size 3.3952.
As other variables are not statistically significant, let’s adjust our model by keeping only attitude as an explanatory variable:
fit2 <- lm(points ~ attitude, data = learning2014)
summary(fit2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now the estimated effect of a one-point increase in variable Attitude is to increase the exam points by 3.5. The estimated effect if highly significant (p-value < 0.001). However, the R-squared is only 0.1906, which means that the explanatory power of the model is low. The variable Attitude only explains 19.06 % of the variation in the variable exam points. The better the model fits the data, the closer the R-squared is to one.
The following diagnostic plots help to evaluate the validity of our linear assumptions. A linear regression model assumes
errors (residuals) are normally distributed
homoscedasticity: errors have mean zero and a constant variance
linear relationship between response and explanatory variables
independence of observations (experimental vs. dependent observations in time series)
all relevant explanatory variables are used
no or little multicollinearity (test with vif() Variance Inflation Factor: A VIF of 1 indicates no multicollinearity for this variable; A VIF of higher than 5 or 10 indicates a problem)
Explanatory variables are uncorrelated with the error term
par(mfrow=c(2,2))
plot(fit2, which=c(1,2,5))
hist(fit2$residuals) # normality
Residuals vs Fitted values
Makes it possible to inspect the assumptions of zero mean and constant variance. The residuals are scattered around zero and their spread does not depend on the fitted values. (With fitted values of 24 or larger, there are some large negative residuals, though.) No patterns in scatter plot, so our model seems to meet assumption 2) homoscedasticity.
Normal QQ-plot
The normal QQ plot inspects whether the assumption 1) errors are normally distributed is met. In case the residuals are normally distributed, they should lie on a straight line in the QQ plot. This is approximately the case, except some deviations at the both ends of the distribution.
Residuals vs Leverage
Residual vs Leverage makes it possible to explore if there are some observations that have a high influence on the model. Data points with large residuals (outliers) and/or high leverage may violate the outcome and accuracy of a regression. Seems like no observations have an excessively large influence on the model fit.
library(tidyverse)
library(ggplot2)
alc_data <- read.csv("./data/student/pormath.csv", stringsAsFactors = TRUE)
dim(alc_data)
## [1] 370 51
colnames(alc_data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
Graphs.
# keep only variables of interest
alc <- alc_data %>% select(high_use,sex,famrel,absences,G3, goout, alc_use)
summary(select(alc_data,.data$G3,.data$absences,.data$famrel,.data$sex,.data$high_use,.data$goout))
## G3 absences famrel sex high_use
## Min. : 0.00 Min. : 0.000 Min. :1.000 F:195 Mode :logical
## 1st Qu.:10.00 1st Qu.: 1.000 1st Qu.:4.000 M:175 FALSE:259
## Median :12.00 Median : 3.000 Median :4.000 TRUE :111
## Mean :11.52 Mean : 4.511 Mean :3.935
## 3rd Qu.:14.00 3rd Qu.: 6.000 3rd Qu.:5.000
## Max. :18.00 Max. :45.000 Max. :5.000
## goout
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :3.116
## 3rd Qu.:4.000
## Max. :5.000
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# or
p_g3 <- ggplot(data=alc_data,aes(x=high_use,y=G3,fill=high_use))
p_g3 + geom_boxplot() + ylab("Final grade")
p_going_out <- ggplot(data=alc_data,aes(x=goout,fill=high_use))
p_going_out + geom_bar() + ylab("Going out")
p_studytime <- ggplot(data=alc_data,aes(x=studytime,fill=high_use))
p_studytime + geom_bar()
p_internet <- ggplot(data=alc_data,aes(x=internet,fill=high_use))
p_internet + geom_bar()
Crosstables
table1 <- table(alc_data$sex, alc_data$high_use,dnn=c("sex","high_alc"))
addmargins(round(prop.table(table1)*100,1))
## high_alc
## sex FALSE TRUE Sum
## F 41.6 11.1 52.7
## M 28.4 18.9 47.3
## Sum 70.0 30.0 100.0
table2 <- table(alc_data$famrel, alc_data$high_use,dnn=c("famrel","high_alc"))
addmargins(round(prop.table(table2)*100,1))
## high_alc
## famrel FALSE TRUE Sum
## 1 1.6 0.5 2.1
## 2 2.4 2.4 4.8
## 3 10.5 6.8 17.3
## 4 34.6 14.1 48.7
## 5 20.8 6.2 27.0
## Sum 69.9 30.0 99.9
table(alc_data$goout, alc_data$high_use)
##
## FALSE TRUE
## 1 19 3
## 2 82 15
## 3 97 23
## 4 40 38
## 5 21 32
Boxplots
The Box plots show the associations between the continuous variables grade and absences with alcohol consumption, drawn separately for females and males.
# initialize a plot of high_use and G3, grouping by sex
g1 <- ggplot(alc, aes(x = high_use, y = G3,col=sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") +
ggtitle("Student grades by alcohol consumption and sex")
logistic <- glm(high_use ~ famrel + absences + G3 + goout + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(logistic)
##
## Call:
## glm(formula = high_use ~ famrel + absences + G3 + goout + sex,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7675 -0.7622 -0.4864 0.7441 2.6334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.19616 0.83823 -2.620 0.00879 **
## famrel -0.41131 0.14192 -2.898 0.00375 **
## absences 0.08020 0.02229 3.598 0.00032 ***
## G3 -0.04008 0.04048 -0.990 0.32213
## goout 0.74913 0.12564 5.963 2.48e-09 ***
## sexM 1.07181 0.26467 4.050 5.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 364.33 on 364 degrees of freedom
## AIC: 376.33
##
## Number of Fisher Scoring iterations: 4
better_logistic <- glm(high_use ~ famrel + absences + goout + sex, data = alc, family = "binomial")
Odd ratios
# compute odds ratios (OR)
OR <- exp(coef(logistic))
# compute confidence intervals (CI)
CI <- logistic %>% confint %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1112294 0.02062473 0.5564523
## famrel 0.6627789 0.50005846 0.8740026
## absences 1.0835025 1.03796942 1.1344714
## G3 0.9607171 0.88730625 1.0403990
## goout 2.1151573 1.66463216 2.7274827
## sexM 2.9206466 1.75080525 4.9533931
OR2 <- exp(coef(better_logistic))
# compute confidence intervals (CI)
CI2 <- better_logistic %>% confint %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR2, CI2)
## OR2 2.5 % 97.5 %
## (Intercept) 0.06646973 0.01731071 0.2381125
## famrel 0.65884637 0.49727062 0.8681721
## absences 1.08565191 1.04010669 1.1366801
## goout 2.15919501 1.70320000 2.7776306
## sexM 2.94291184 1.76506498 4.9902818
#predict() the probability of high_use
probabilities <- predict(logistic, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65675676 0.04324324 0.70000000
## TRUE 0.15675676 0.14324324 0.30000000
## Sum 0.81351351 0.18648649 1.00000000
I conduct a simple model validation by tabulating the actual outcomes versus model predictions. The proportion of incorrectly predicted students is around 20 percent (15.7 + 4.3). Thus, I would next search for additional explanatory variables!
Loss function:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2
Cross validation:
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2
# K-fold cross-validation
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
cv <- cv.glm(data = alc, cost = loss_func, glmfit = logistic, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2054054
alc2 <- alc_data %>% select(G3, goout,studytime,internet, high_use, absences, health, internet, nursery, romantic, sex)
logistic2 <- glm(high_use ~ internet + nursery + goout + sex, data = alc2, family = "binomial")
summary(logistic2)
##
## Call:
## glm(formula = high_use ~ internet + nursery + goout + sex, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7162 -0.8349 -0.5965 0.8533 2.5304
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5003 0.5761 -6.076 1.24e-09 ***
## internetyes -0.0319 0.3549 -0.090 0.928387
## nurseryyes -0.3894 0.3055 -1.275 0.202418
## goout 0.7618 0.1198 6.361 2.01e-10 ***
## sexM 0.9357 0.2519 3.715 0.000203 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 387.34 on 365 degrees of freedom
## AIC: 397.34
##
## Number of Fisher Scoring iterations: 4
#predict() the probability of high_use
probabilities2 <- predict(logistic2, type = "response")
# add the predicted probabilities to 'alc'
alc2 <- mutate(alc2, probability = probabilities2)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)
loss_func(class = alc2$high_use, prob = alc2$probability)
## [1] 0.2189189
cv2 <- cv.glm(data = alc2, cost = loss_func, glmfit = logistic2, K = 5)
# average number of wrong predictions in the cross validation
cv2$delta[1]
## [1] 0.2432432
library(MASS)
library(GGally)
library(ggplot2)
library(tidyverse)
library(corrplot)
head(Boston, n = 4)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Variables scale quite differently.
# plot distributions and correlations of continuous variables
p <- Boston %>%
ggpairs(
mapping = aes(alpha=0.5),
lower = list(continuous = wrap("points", colour = "cornflower blue", size=0.3)),
upper = list(continuous = wrap("cor", size = 2)),
diag = list(continuous = wrap("densityDiag", colour = "cornflower blue"))) +
theme(axis.text.x = element_text(size=5),
axis.text.y = element_text(size=5))
p
There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. Also checking the correlations.
cor_matrix <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),tl.cex=0.7)
There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.
Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:
boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
sd(boston_sc$dis)
## [1] 1
Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.
Then we will create a categorical variable called crime which is based on the quantiles of crim variable:
brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim, breaks = brks, label = lbls, include.lowest = TRUE)
boston_sc$crime <- crime
boston_sc <- dplyr::select(boston_sc, -crim) # Remove the old crim variable
summary(boston_sc$crime)
## low med_low med_high high
## 127 126 126 127
Then let’s divide the data into training (80%) and testing sets (20%):
n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]
Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.
Next I estimate a linear discriminant model with crime as the target variable. The purpose of the analysis is to identify those variables that explain whether a tract has a high or low crime rate. We have here a four-class grouping of crime rate.
A linear discriminant model assumes that explanatory variables are continuous and normally distributed given the classes defined by the target variable. Moreover, a constant variance across the explanatory variables is assumed. According to the preliminary analysis, the assumption of normality is not satisfied. I do not check whether the assumption is satisfied given the crime class but simply assume normality. The constant variance assumption is satisfied because of scaling.
The results from linear discriminant analysis are shown below. The prior probabilities show the proportions of observations belonging to the four groups in the train data. They are not exactly equal because the grouping was done with all the 506 tracts. The variable means differ across crime groups suggesting that they have an association with the crime rate.
The first linear discriminant explains 95% of the variance between the groups based on crime rate.
lda_fit <- lda(crime ~ ., data = train_set)
lda_fit
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2450495 0.2599010 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 0.97037019 -0.9651881 -0.116404306 -0.9252246 0.47155981 -0.9146792
## med_low -0.05976813 -0.2739222 0.006051757 -0.5773165 -0.09166025 -0.3374251
## med_high -0.38025152 0.1434859 0.140129052 0.3866161 0.04675566 0.4160315
## high -0.48724019 1.0149946 -0.033716932 1.0734229 -0.44471256 0.8109429
## dis rad tax ptratio black lstat
## low 0.9562978 -0.6862262 -0.7209948 -0.46148873 0.37654866 -0.77532277
## med_low 0.3510420 -0.5561264 -0.4471322 -0.07417466 0.30919137 -0.17155481
## med_high -0.3782580 -0.4579515 -0.3647014 -0.29135684 0.08966411 0.05697214
## high -0.8430128 1.6596029 1.5294129 0.80577843 -0.71377433 0.88890774
## medv
## low 0.55429260
## med_low 0.02026568
## med_high 0.12960118
## high -0.70460017
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.13130839 0.519779886 -0.87969228
## indus 0.01841655 -0.258214756 0.44146395
## chas 0.00501796 -0.006971278 0.11580557
## nox 0.31927340 -0.831506191 -1.42187713
## rm 0.01336716 -0.033781878 -0.16078294
## age 0.20142313 -0.310804913 -0.07038329
## dis -0.15396449 -0.034370496 0.03467908
## rad 4.05501718 0.778544224 -0.15543608
## tax -0.01289706 0.344238361 0.59219477
## ptratio 0.22967422 -0.049501812 -0.28478152
## black -0.11413656 0.046475067 0.12279414
## lstat 0.10078949 -0.192016962 0.18485136
## medv 0.01816022 -0.341646931 -0.23220188
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9624 0.0301 0.0075
The LDA biplot based on the estimated model is shown below. The observations are colored on the basis of their crime group. The arrows indicate that variable rad (index of accessibility to radial highways) is a strong predictor of linear discriminant 1, while variables nox (air pollution) and zn (prop. of residential land zoned for large lots) explain linear discriminant 2.
# the function for lda biplot arrows
lda_arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train_set$crime)
colnames(train_set)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
# plot the lda results
plot(lda_fit, dimen = 2, col=classes, pch=classes)
lda_arrows(lda_fit, myscale = 2)
I use the test data to validate the model, ie. to see whether the observations in the test data are correctly classified.
# save the correct classes from test data
correct_classes <- test_set$crime
# remove the crime variable from test data
test <- dplyr::select(test_set, -crime)
The table below shows (after a little calculation) that roughly 1/3 of observations lie outside the diagonal i.e. are incorrectly predicted by the model. (if prop.table used).
If addmargins() used: It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.
# predict classes with test data
lda_pred <- predict(lda_fit, newdata = test)
# Confusion Matrix and Accuracy
tab1 <- table(correct = correct_classes, predicted = lda_pred$class) %>% addmargins() #%>% prop.table
accuracy1 <- sum(diag(tab1))/sum(tab1)
accuracy1
## [1] 0.4289216
As a final step, I run a K-means clustering analysis with Boston data set. K-means clustering divides observations into pre-defined number of clusters (K), by minimizing the distance of observations to cluster means (centroids). I first look at the distances between observations, using a popular distance measure, Euclidean distance.
library(MASS)
data('Boston')
# remove variable chas
#Boston <- Boston %>% dplyr::select(-chas)
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
I first choose three clusters (K=3). The plot below suggests that variables black, and tax have a strong association with clusters. Many of the other variables seem to have a somewhat weaker effect, too.
# k-means clustering
km <- kmeans(Boston, centers = 2)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 28 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[1:10], col = km$cluster)
pairs(boston_scaled[4:6], col = km$cluster)
pairs(boston_scaled[7:9], col = km$cluster)
pairs(boston_scaled[10:12], col = km$cluster)
pairs(boston_scaled[13:14], col = km$cluster)
I search for optimal number of clusters K by inspecting how the total of within cluster sum of squares (total WCSS) changes when K changes. I let K run from 1 to 10. The optimal number of clusters is the value of K where the total WCSS drops rapidly.
The plot below shows that the optimal number of clusters is 2.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
I choose K=2 and re-run the K-means clustering algoritm. The plot below gives support to dividing this data set to two clusters. Twcss = total within cluster sum of square
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with 2 clusters
pairs(boston_scaled[1:6], col = km$cluster)
pairs(boston_scaled[7:14], col = km$cluster)
data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda, dimen=2, col=boston_again$cluster)
lda_arrows(boston_lda, myscale=2)
It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).
model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
summary(model_predictors)
## zn indus chas nox
## Min. :-0.487240 Min. :-1.55630 Min. :-0.272329 Min. :-1.464433
## 1st Qu.:-0.487240 1st Qu.:-0.90255 1st Qu.:-0.272329 1st Qu.:-0.920756
## Median :-0.487240 Median :-0.37560 Median :-0.272329 Median :-0.195854
## Mean : 0.009721 Mean :-0.02241 Mean : 0.000539 Mean :-0.009254
## 3rd Qu.: 0.289908 3rd Qu.: 1.01499 3rd Qu.:-0.272329 3rd Qu.: 0.658496
## Max. : 3.800473 Max. : 2.42017 Max. : 3.664771 Max. : 2.729645
## rm age dis rad
## Min. :-3.876413 Min. :-2.222999 Min. :-1.26230 Min. :-0.98187
## 1st Qu.:-0.564866 1st Qu.:-0.810864 1st Qu.:-0.80209 1st Qu.:-0.63733
## Median :-0.081317 Median : 0.317068 Median :-0.22678 Median :-0.52248
## Mean :-0.001396 Mean :-0.004508 Mean : 0.02021 Mean :-0.02017
## 3rd Qu.: 0.472328 3rd Qu.: 0.914783 3rd Qu.: 0.70867 3rd Qu.: 1.65960
## Max. : 3.473251 Max. : 1.116390 Max. : 3.95660 Max. : 1.65960
## tax ptratio black lstat
## Min. :-1.312691 Min. :-2.70470 Min. :-3.9033 Min. :-1.503007
## 1st Qu.:-0.756434 1st Qu.:-0.49910 1st Qu.: 0.2132 1st Qu.:-0.827687
## Median :-0.464213 Median : 0.13602 Median : 0.3797 Median :-0.173372
## Mean :-0.009823 Mean :-0.01182 Mean : 0.0183 Mean :-0.003237
## 3rd Qu.: 1.529413 3rd Qu.: 0.80578 3rd Qu.: 0.4331 3rd Qu.: 0.607674
## Max. : 1.796416 Max. : 1.63721 Max. : 0.4406 Max. : 3.545262
## medv
## Min. :-1.906340
## 1st Qu.:-0.626046
## Median :-0.144916
## Mean : 0.004561
## 3rd Qu.: 0.268258
## Max. : 2.986505
summary(lda_fit$scaling)
## LD1 LD2 LD3
## Min. :-0.153964 Min. :-0.83151 Min. :-1.42188
## 1st Qu.: 0.005018 1st Qu.:-0.25821 1st Qu.:-0.23220
## Median : 0.018417 Median :-0.03437 Median :-0.07038
## Mean : 0.370112 Mean :-0.02844 Mean :-0.13180
## 3rd Qu.: 0.201423 3rd Qu.: 0.04648 3rd Qu.: 0.12279
## Max. : 4.055017 Max. : 0.77854 Max. : 0.59219
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)
summary(matrix_product)
## LD1 LD2 LD3
## Min. :-5.17115 Min. :-3.647671 Min. :-3.271755
## 1st Qu.:-3.15326 1st Qu.:-0.626416 1st Qu.:-0.533163
## Median :-2.21145 Median :-0.042644 Median : 0.120865
## Mean :-0.09284 Mean : 0.000695 Mean :-0.002707
## 3rd Qu.: 6.53189 3rd Qu.: 0.722263 3rd Qu.: 0.587911
## Max. : 8.22523 Max. : 3.102531 Max. : 2.666062
# install.packages("plotly")
library(plotly)
plot_ly(x=matrix_product$LD1, y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=classes, size=I(40))
train_km <- kmeans(train_set[,-14],centers=4)
cluster_col <- train_km$cluster
plot_ly(x=matrix_product$LD1, y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col, size=I(40))
The way the points are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.